Welcome to the Exploratory Data Analysis of the CRAN historical data set. As you may already know, CRAN is a network of servers around the world which store code and documentation for the R packages over time. As of writing this EDA, CRAN had just over 18,000 packages available in it’s repository.
Heads or Tails has done a great job of grabbing historical data, cleaning it up and preparing it for us R enthusiasts. Read about the approach he followed in his blogpost.
Read through the initial setup in the 4 tabs below.
First, some I import some useful libraries and set some plotting defaults.
# Data Manipulation
library(dplyr)
library(tidyr)
library(readr)
library(skimr)
library(purrr)
library(stringr)
library(urltools)
# Plots
library(ggplot2)
library(naniar)
library(packcircles)
library(ggridges)
# Tables
library(reactable)
# Settings
theme_set(theme_minimal(
base_size = 14
))Let’s start be reading in the data. There are two CSV
files in this dataset. From his dataset
page:
cran_package_overview.csv: all R packages currently
available through CRAN, with (usually) 1 row per package…cran_package_history.csv: version history of virtually
all packages in the previous table...hist_dt <- read_csv(
"../input/cran_package_history.csv",
col_types = cols(
package = col_character(),
version = col_character(),
date = col_date(format = "%Y-%m-%d"),
repository = col_character()
)
)
ov_dt <- read_csv(
"../input/cran_package_overview.csv",
col_types = cols(
package = col_character(),
version = col_character(),
depends = col_character(),
imports = col_character(),
license = col_character(),
needs_compilation = col_logical(),
author = col_character(),
bug_reports = col_character(),
url = col_character(),
date_published = col_date(format = "%Y-%m-%d"),
description = col_character(),
title = col_character()
)
)I love to take the first peek into a dataset with the amazing {skimr}
package. We can see that we have the right data types set for all the
columns, dates have been imported correctly.
We can see in the history data that the 1st package
reported on CRAN was in 22 years ago on 1998-02-25!
Furthermore, the overview tells us there’s a package
{pack} last published/updated on 2008-09-08.
While there’s no missing data in the history dataset,
there are a bunch of missing values in the overview
dataset. Let’s explore this a bit more.
skimr::skim(hist_dt)── Data Summary ────────────────────────
Values
Name hist_dt
Number of rows 119464
Number of columns 4
_______________________
Column type frequency:
character 3
Date 1
________________________
Group variables None
── Variable type: character ─────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 package 0 1 2 32 0 18372 0
2 version 0 1 3 15 0 10074 0
3 repository 0 1 4 7 0 2 0
── Variable type: Date ──────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max median n_unique
1 date 0 1 1998-02-25 2022-07-17 2018-01-22 7119
skimr::skim(ov_dt)── Data Summary ────────────────────────
Values
Name ov_dt
Number of rows 18388
Number of columns 12
_______________________
Column type frequency:
character 10
Date 1
logical 1
________________________
Group variables None
── Variable type: character ─────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 package 0 1 2 32 0 18371 0
2 version 0 1 3 14 0 2112 0
3 depends 4862 0.736 2 218 0 4289 0
4 imports 4026 0.781 2 573 0 12041 0
5 license 0 1 3 54 0 159 0
6 author 0 1 4 4096 0 15652 0
7 bug_reports 10760 0.415 11 81 0 7578 0
8 url 8426 0.542 4 466 0 9548 0
9 description 0 1 5 7372 0 18368 0
10 title 0 1 5 210 0 18306 0
── Variable type: Date ──────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max median n_unique
1 date_published 1 1.00 2008-09-08 2022-07-17 2021-03-22 2542
── Variable type: logical ───────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean count
1 needs_compilation 0 1 0.239 FAL: 13991, TRU: 4397
My favorite way of exploring missing data is to make it visible,
using Nick Tierney’s
amazing {naniar}
package. There are a few columns with missing data. Let’s look at these
more closely.
depends and imports have roughly a quarter
of the data as <NA>. These are packages which
roughly have no external dependencies. The difference between
the two can get a bit complex; best to learn about it in Hadley’s
chapter here.bug_reports and url are
missing. These don’t seem to be data issues as much as authors who don’t
have a place to issue bugs or website for their package
respectively.date_published has only 1 row with missing, which seems
like a data quality spill.ov_dt |>
dplyr::arrange(date_published) |>
vis_miss()Since this is an open ended exploration - unlike other EDA with the purpose of building a predictive model - before I continue to the plotting, I’d like to posit some questions which will guide the flow of further work. The first five questions are from Martin’s blog, with further questions which I think would be interesting to explore.
To aid answering many of these, I first need to create a few new
features in the overview data set.
Read about the feature development in the tabs below. We go from
12 columns to 29 columns in the overview data set.
Per the R package section in Hadley’s book,
“an R package version is a sequence of at least two integers
separated by either . or -. For example,
1.0 and 0.9.1-10 are valid versions, but
1 and 1.0-devel are not”. Typically,
packages do follow the three number format of
<major>.<minor>.<patch>. I’m making an
assumption this is true, just to simplify things. I have a feeling it’ll
capture most of the cases.
This feature could help answer questions about version number progressions.
split_versions <- function(dat) {
stopifnot("version" %in% names(dat))
dat |>
separate(
version,
into =
c("major", "minor", "patch"),
sep = "\\.",
extra = "merge", # for versions like 1.0.3-3000, keep the '3-3000' together in the 3rd col
fill = "right",
remove = FALSE
)
}
ov_dt <- ov_dt |> split_versions()
hist_dt <- hist_dt |> split_versions()
head(hist_dt) |>
reactable(compact = TRUE)For the last published version of the package, how many dependencies and/and imports does each package have? My hypothesis is that packages in the past relied on lesser dependencies since they were more likely than not written in base R. With the recent explosion of adoption of R, and the adoption of the tidyverse framework, more recent packages would have a larger set of dependencies.
ov_dt <- ov_dt |>
mutate(
# Dependencies
num_dep = purrr::map_int(
.x = depends,
.f = function(x){
x |>
stringr::str_split(",", simplify = TRUE) |>
length()
}
),
num_dep = ifelse(is.na(depends), 0, num_dep),
# Imports
num_imports = purrr::map_int(
.x = imports,
.f = function(x){
x |>
stringr::str_split(",", simplify = TRUE) |>
length()
}
),
num_imports = ifelse(is.na(imports), 0, num_imports)
)
glimpse(ov_dt, 100)Rows: 18,388
Columns: 17
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
Temporal features typically useful for aggregation downstream.
hist_dt <- hist_dt |>
mutate(
year = lubridate::year(date),
month = lubridate::month(date, label = TRUE),
day = lubridate::day(date),
wday = lubridate::wday(date, label = TRUE),
yr_mon = sprintf("%d-%s", year, month),
dt = lubridate::ym(paste0(year, "-", month))
)
ov_dt <- ov_dt |>
filter(!is.na(date_published)) |>
mutate(
year = lubridate::year(date_published),
month = lubridate::month(date_published, label = TRUE),
day = lubridate::day(date_published),
wday = lubridate::wday(date_published, label = TRUE),
yr_mon = sprintf("%d-%s", year, month),
dt = lubridate::ym(paste0(year, "-", month))
)
glimpse(ov_dt, 100)Rows: 18,387
Columns: 24
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
How long are the titles and description fields in the latest package submissions? Any interesting trends over time?
ov_dt <- ov_dt |>
mutate(
len_title = purrr::map_int(title, ~ stringr::str_count(.x, "\\w+")),
len_desc = purrr::map_int(description, ~ stringr::str_count(.x, "\\w+"))
)
glimpse(ov_dt, 100)Rows: 18,387
Columns: 26
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
$ len_title <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
$ len_desc <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
The raw dataset has 159 unique levels for the license
variable.
ov_dt |>
count(license) |>
reactable(compact = TRUE)But many of being quite similar to each other, some binning is in
order to extract some patterns. Here, I use the case_when
to bin together similar licenses. (I’m no expert in these licenses. I’m
sure I’m taking some liberties in the grouping here).
ov_dt <- ov_dt |>
mutate(
license_cleaned = case_when(
str_detect(license, "^GPL-3") ~ "GPL-3",
str_detect(license, "^GPL\\s\\([\\s\\d\\.<=>]*3") ~ "GPL-3",
str_detect(license, "^GPL-2") ~ "GPL-2",
str_detect(license, "^GPL\\s\\([\\s\\d\\.<=>]*2") ~ "GPL-2",
str_detect(license, "^AGPL") ~ "AGPL",
str_detect(license, "^LGPL") ~ "LGPL",
str_detect(license, "Apache") ~ "Apache",
str_detect(license, "BSD") ~ "BSD",
str_detect(license, "LGPL") ~ "LGPL",
str_detect(license, "MIT") ~ "MIT",
str_detect(license, "CC0") ~ "CC0",
license == "GPL" ~ "GPL",
TRUE ~ "Other"
# str_detect(license, "GNU") ~ "GNU", # Left these out after some trials with plots below
# str_detect(license, "MPL") ~ "MPL",
# str_detect(license, "Unlimited") ~ "Unlimited",
# str_detect(license, "^CC") ~ "CC",
)
)
glimpse(ov_dt, 100)Rows: 18,387
Columns: 27
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
$ len_title <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
$ len_desc <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
$ license_cleaned <chr> "GPL-2", "GPL-3", "GPL-3", "GPL-3", "MIT", "GPL-3", "GPL-3", "GPL-3", "G…
Which domains do package authors typically use? My guess is GitHub rules them all, but is that true? Can we see any rise of other offerings like GitLab or BitBucket?
ov_dt <- ov_dt |>
mutate(url_domain = map_chr(url,
~ {
if (is.na(.x))
return(NA)
else
return(url_parse(.x)$domain)
}),
bug_domain = map_chr(bug_reports,
~ {
if (is.na(.x))
return(NA)
else
return(url_parse(.x)$domain)
}))
glimpse(ov_dt, 100)Rows: 18,387
Columns: 29
$ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
$ len_title <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
$ len_desc <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
$ license_cleaned <chr> "GPL-2", "GPL-3", "GPL-3", "GPL-3", "MIT", "GPL-3", "GPL-3", "GPL-3", "G…
$ url_domain <chr> NA, NA, "shiny.abdn.ac.uk", "github.com", "github.com", NA, NA, NA, NA, …
$ bug_domain <chr> NA, "github.com", NA, NA, "github.com", NA, NA, NA, NA, NA, "github.com"…
Now that I have the data sets prepared and ready, it’s time for the fun part - being creative and creating some interesting visuals! Let’s attack those questions one at a time.
Q: How have the dependencies & imports changed over time?
Taking the last published year-month combo for the active packages in the repo, I can calculate the median values for dependencies and imports. Medians will be robust against outliers, while also conviniently giving us whole numbers.
deps <- ov_dt |>
select(year, len_desc, len_title) |>
arrange(-year) |>
filter(!is.na(year), year > 2008) |>
mutate(year = factor(year, levels = seq(2008, 2022)))
deps |>
pivot_longer(-year) |>
ggplot(aes(y = year, x = value, fill = name)) +
stat_density_ridges(
bandwidth = 4,
jittered_points = F,
position = position_points_jitter(height = 0),
point_shape = "|",
point_size = 2,
size = 0.25,
scale = .95,
quantile_lines = TRUE,
quantiles = 2,
alpha = 0.7,
rel_min_height = 0.01) +
scale_x_continuous(limits = c(0,200), expand = c(0,0)) +
coord_cartesian(clip = "off") +
theme_ridges(center = TRUE)ov_dt |>
group_by(dt) |>
summarise_at(vars(len_title, len_desc), list(median = median, sd = sd), na.rm = TRUE) |>
ggplot(aes(x= dt)) +
geom_jitter(aes(y = len_title_median, color = "len_title_median"), alpha = 0.2) +
geom_smooth(aes(y = len_title_median, color = "len_title_median"), span = 0.3, se = FALSE) +
geom_jitter(aes(y = len_desc_median, color = "len_desc_median"), alpha = 0.2) +
geom_smooth(aes(y = len_desc_median, color = "len_desc_median"), span = 0.3, se = FALSE) # theme_light()ov_dt |>
filter(year %in% c(2022, 2020, 2018)) |>
ggplot() +
geom_density(aes(x = len_desc,
fill = as.factor(year),
color = as.factor(year)
),
alpha = 0.3
)ov_dt |>
filter(year %in% c(2022, 2020, 2018)) |>
ggplot() +
geom_histogram(aes(x = len_desc,
fill = as.factor(year),
color = as.factor(year)
),
alpha = 0.3
) +
facet_wrap(~year)ov_dt |>
ggplot(aes(x= date_published, y = len_title)) +
geom_jitter(alpha = 0.05) +
geom_smooth(span = 0.1, se = FALSE) +
theme_light() +
scale_y_log10()
ov_dt |>
ggplot(aes(x= date_published, y = len_desc)) +
geom_jitter(alpha = 0.05) +
geom_smooth(span = 0.2, se = FALSE) +
theme_light() +
scale_y_log10()ov_dt |>
group_by(license_cleaned) |>
count() |>
ggplot(aes(x = forcats::fct_reorder(license_cleaned, n), y = n, fill = license_cleaned)) +
geom_col() +
coord_flip() +
theme_minimal() +
guides(fill = FALSE) +
labs(x = "", y = "")Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
plot_bubbles <- function(dat,
.scale,
plot_radius,
bubble_radius,
alpha,
maxiter) {
.qty <- nrow(dat)
theta <- seq(0, 360, length.out = .qty + 1)
dat$x <- plot_radius * cos(theta * pi / 180)[-1]
dat$y <- plot_radius * sin(theta * pi / 180)[-1]
dat$n_scaled <- dat$n / .scale
xpack <- rep(dat$x, times = dat$n_scaled)
ypack <- rep(dat$y, times = dat$n_scaled)
coords <- tibble(
x = xpack + runif(length(xpack)),
y = ypack + runif(length(ypack)),
r = bubble_radius
)
packed_coords <-
circleRepelLayout(coords, sizetype = "r", maxiter = maxiter)
packed_coords$layout |>
ggplot(aes(x, y)) +
geom_point(aes(size = radius), alpha = alpha) +
coord_equal() +
theme_minimal() +
theme(
legend.position = "none",
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank()
) +
geom_text(
aes(
x = x,
y = y,
label = label
),
data = dat,
hjust = "center",
vjust = "center"
)
}
ov_dt |>
count(license_cleaned) |>
top_n(6, n) |>
mutate(label = sprintf("%s\n%d Pkgs", license_cleaned, n)) |>
arrange(runif(1:n())) |>
plot_bubbles(
.scale = 100,
plot_radius = 10,
bubble_radius = 0.56,
alpha = 0.2,
maxiter = 1000
)
# .scale <- 100
# .qty <- 6
# lic <- ov_dt |>
# group_by(license_cleaned) |>
# count() |>
# arrange(-n) |>
# head(.qty) |>
# mutate(n_scaled = round(n / .scale)) |>
# arrange(runif(1:n()))
#
# r <- 10
# theta <- seq(0, 360, length.out = .qty+1)
# lic$x <- r * cos(theta * pi / 180)[-1]
# lic$y <- r * sin(theta * pi / 180)[-1]
# xpack <- rep(lic$x, times=lic$n_scaled)
# ypack <- rep(lic$y, times=lic$n_scaled)
#
# coords <- tibble(x=xpack+runif(length(xpack)),
# y=ypack+runif(length(ypack)),
# r=.56)
# packed_coords <- circleRepelLayout(coords, sizetype="r", maxiter=1000)
# packed_coords$layout |>
# ggplot(aes(x, y)) +
# geom_point(aes(size = radius), alpha = 0.2) +
# coord_equal() +
# theme_minimal() +
# theme(legend.position = "none",
# panel.grid = element_blank(),
# axis.title = element_blank(),
# axis.text = element_blank()) +
# geom_text(aes(x = x,
# y = y,
# label = sprintf("%s\n%d Pkgs", license_cleaned, n)),
# data = lic,
# hjust = "center",
# vjust = "center") ov_dt |>
group_by(dt) |>
count(license_cleaned) |>
mutate(license_cleaned = forcats::fct_reorder(license_cleaned, n)) |>
ggplot(aes(x= dt, y = n, color = license_cleaned)) +
# geom_line( alpha = 0.3) +
geom_jitter(alpha = 0.3) +
geom_smooth(span = 0.3, se = FALSE) +
theme_light()ov_dt |>
group_by(dt) |>
count(url_exist = is.na(url)) |>
ggplot(aes(x= dt, y = n, color = url_exist)) +
geom_jitter(alpha = 0.3) +
geom_smooth(span = 0.3, se = FALSE) +
theme_light()ov_dt |>
group_by(dt) |>
count(url_exist = is.na(bug_reports)) |>
ggplot(aes(x= dt, y = n, color = url_exist)) +
geom_jitter(alpha = 0.3) +
geom_smooth(span = 0.3, se = FALSE) +
theme_light()ov_dt |>
filter(domain != "") |>
mutate(domain = forcats::fct_lump_min(domain, 20)) |>
group_by(dt) |>
count(domain) |>
ggplot(aes(x= dt, y = n, color = domain)) +
geom_jitter(alpha = 0.3) +
geom_smooth(span = 0.5, se = FALSE) +
theme_light()ov_dt |>
filter(domain != "") |>
mutate(domain = forcats::fct_lump_min(domain, 20)) |>
count(domain) |>
mutate(label = sprintf("%s\n%d", domain, n)) |>
arrange(runif(1:n())) |>
plot_bubbles(
.scale =50,
plot_radius = 6,
bubble_radius = 0.4,
alpha = 0.2,
maxiter = 1000
)ov_dt |>
filter(!is.na(dt), dt < "2022-07-01") |>
count(dt) |>
arrange(dt) |>
timetk::pad_by_time(dt, .by = "month", .pad_value = 0) |>
ggplot(aes(dt, n)) +
geom_line()
ov_dt |>
filter(!is.na(dt), dt < "2022-07-01") |>
count(dt) |>
arrange(dt) |>
timetk::pad_by_time(dt, .by = "month", .pad_value = 0) -> xdat
timetk::plot_seasonal_diagnostics(xdat, dt, log(n), .interactive = FALSE)
timetk::plot_stl_diagnostics(xdat |> filter(dt > "2018-01-01", dt < "2022-07-01"), dt, n, .interactive = FALSE, .feature_set = c("observed", "season", "trend", "remainder"))frequency = 12 observations per 1 year
trend = 12 observations per 1 year